Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.8 - Case Statements")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
| 3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white |
| 4 | 4 | 2 | None | 2019-01-01 10:05:10 | 13 | None |
Case Statements
Case statements are usually used for performing stateful calculations.
ie.
- if
xthena - if
ythenb - everything else
c
Using Switch/Case Statements in Spark
(
pets
.withColumn(
'oldness_value',
F.when(F.col('age') <= 5, 'young')
.when((F.col('age') > 5) & (F.col('age') <= 10), 'middle age')
.otherwise('old')
)
.toPandas()
)
| id | breed_id | nickname | birthday | age | color | oldness_value | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | young |
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | middle age |
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | old |
| 3 | 3 | 2 | Maple | 2018-11-22 10:05:10 | 17 | white | old |
| 4 | 4 | 2 | None | 2019-01-01 10:05:10 | 13 | None | old |
What Happened?
Based on the age of the pet, we classified if they are either young, middle age or old. Please don't take offense, this is merely an example.
We mapped the logic of:
- If their age is younger than or equal to 5, then they are considered
young. - If their age is greater than 5 but younger than or equal to 10 , then they are considered
middle age. - Anyone older is considered
old.
Summary
- We learned how to map values based on case statements and a deafult value if all conditions are not satified.